procruste distance
When Embedding Models Meet: Procrustes Bounds and Applications
Maystre, Lucas, Gonzalez, Alvaro Ortega, Park, Charles, Dolga, Rares, Berariu, Tudor, Zhao, Yu, Ciosek, Kamil
Embedding models trained separately on similar data often produce representations that encode stable information but are not directly interchangeable. This lack of interoperability raises challenges in several practical applications, such as model retraining, partial model upgrades, and multimodal search. Driven by these challenges, we study when two sets of embeddings can be aligned by an orthogonal transformation. We show that if pairwise dot products are approximately preserved, then there exists an isometry that closely aligns the two sets, and we provide a tight bound on the alignment error. This insight yields a simple alignment recipe, Procrustes post-processing, that makes two embedding models interoperable while preserving the geometry of each embedding space. Empirically, we demonstrate its effectiveness in three applications: maintaining compatibility across retrainings, combining different models for text retrieval, and improving mixed-modality search, where it achieves state-of-the-art performance.
Knowledge distillation through geometry-aware representational alignment
Bhattarai, Prajjwal, Amjad, Mohammad, Zhylko, Dmytro, Alhanai, Tuka
Knowledge distillation is a common paradigm for transferring capabilities from larger models to smaller ones. While traditional distillation methods leverage a probabilistic divergence over the output of the teacher and student models, feature-based distillation methods often minimize variants of Euclidean norms between the hidden layer representations. The main goal is for the student to mimic the structure of the feature space of the teacher. In this work, we theoretically show that existing feature distillation methods, such as projection based mean squared loss or Centered Kernel Alignment (CKA), cannot capture the feature structure, even under zero loss. We then motivate the use of Procrustes distance and the Frobenius norm of Feature Gram Matrix, distances already common in the context of measuring representational alignment, as distillation losses. We show that feature distillation through our method showcases statistically significant improvement in distillation performance across language models families (BERT and OPT) in classification and instruction-following tasks by up to 2 percentage points, showcasing the potential of integrating feature geometry into existing distillation methods. While large models are achieving state-of-the-art results across almost all vision and language tasks, the emergent abilities these models exhibit (Wei et al., 2022; Liang et al., 2023b) are often inaccessible to the public as a result of their inherent size and operating costs. Knowledge Distillation (KD) is one of the many paradigms that aim to bridge the gap between size and performance by inducing ways of transferring knowledge and abilities from a larger, complex model (teacher) to a smaller and accessible model (student).
The Differential Meaning of Models: A Framework for Analyzing the Structural Consequences of Semantic Modeling Decisions
Stine, Zachary K., Deitrick, James E.
The proliferation of methods for modeling of human meaning-making constitutes a powerful class of instruments for the analysis of complex semiotic systems. However, the field lacks a general theoretical framework for describing these modeling practices across various model types in an apples-to-apples way. In this paper, we propose such a framework grounded in the semiotic theory of C. S. Peirce. We argue that such models measure latent symbol geometries, which can be understood as hypotheses about the complex of semiotic agencies underlying a symbolic dataset. Further, we argue that in contexts where a model's value cannot be straightforwardly captured by proxy measures of performance, models can instead be understood relationally, so that the particular interpretive lens of a model becomes visible through its contrast with other models. This forms the basis of a theory of model semantics in which models, and the modeling decisions that constitute them, are themselves treated as signs. In addition to proposing the framework, we illustrate its empirical use with a few brief examples and consider foundational questions and future directions enabled by the framework.
Can Biologically Plausible Temporal Credit Assignment Rules Match BPTT for Neural Similarity? E-prop as an Example
Liu, Yuhan Helena, Yang, Guangyu Robert, Cueva, Christopher J.
Understanding how the brain learns may be informed by studying biologically plausible learning rules. These rules, often approximating gradient descent learning to respect biological constraints such as locality, must meet two critical criteria to be considered an appropriate brain model: (1) good neuroscience task performance and (2) alignment with neural recordings. While extensive research has assessed the first criterion, the second remains underexamined. Employing methods such as Procrustes analysis on well-known neuroscience datasets, this study demonstrates the existence of a biologically plausible learning rule -- namely e-prop, which is based on gradient truncation and has demonstrated versatility across a wide range of tasks -- that can achieve neural data similarity comparable to Backpropagation Through Time (BPTT) when matched for task accuracy. Our findings also reveal that model architecture and initial conditions can play a more significant role in determining neural similarity than the specific learning rule. Furthermore, we observe that BPTT-trained models and their biologically plausible counterparts exhibit similar dynamical properties at comparable accuracies. These results underscore the substantial progress made in developing biologically plausible learning rules, highlighting their potential to achieve both competitive task performance and neural data similarity.
What Representational Similarity Measures Imply about Decodable Information
Harvey, Sarah E., Lipshutz, David, Williams, Alex H.
Neural responses encode information that is useful for a variety of downstream tasks. A common approach to understand these systems is to build regression models or ``decoders'' that reconstruct features of the stimulus from neural responses. Popular neural network similarity measures like centered kernel alignment (CKA), canonical correlation analysis (CCA), and Procrustes shape distance, do not explicitly leverage this perspective and instead highlight geometric invariances to orthogonal or affine transformations when comparing representations. Here, we show that many of these measures can, in fact, be equivalently motivated from a decoding perspective. Specifically, measures like CKA and CCA quantify the average alignment between optimal linear readouts across a distribution of decoding tasks. We also show that the Procrustes shape distance upper bounds the distance between optimal linear readouts and that the converse holds for representations with low participation ratio. Overall, our work demonstrates a tight link between the geometry of neural representations and the ability to linearly decode information. This perspective suggests new ways of measuring similarity between neural systems and also provides novel, unifying interpretations of existing measures.
Resampling and averaging coordinates on data
Blumberg, Andrew J., Carriere, Mathieu, Fung, Jun Hou, Mandell, Michael A.
We introduce algorithms for robustly computing intrinsic coordinates on point clouds. Our approach relies on generating many candidate coordinates by subsampling the data and varying hyperparameters of the embedding algorithm (e.g., manifold learning). We then identify a subset of representative embeddings by clustering the collection of candidate coordinates and using shape descriptors from topological data analysis. The final output is the embedding obtained as an average of the representative embeddings using generalized Procrustes analysis.
Estimating Shape Distances on Neural Representations with Limited Samples
Pospisil, Dean A., Larsen, Brett W., Harvey, Sarah E., Williams, Alex H.
Measuring geometric similarity between high-dimensional network representations is a topic of longstanding interest to neuroscience and deep learning. Although many methods have been proposed, only a few works have rigorously analyzed their statistical efficiency or quantified estimator uncertainty in data-limited regimes. Here, we derive upper and lower bounds on the worst-case convergence of standard estimators of shape distance$\unicode{x2014}$a measure of representational dissimilarity proposed by Williams et al. (2021).These bounds reveal the challenging nature of the problem in high-dimensional feature spaces. To overcome these challenges, we introduce a new method-of-moments estimator with a tunable bias-variance tradeoff. We show that this estimator achieves substantially lower bias than standard estimators in simulation and on neural data, particularly in high-dimensional settings. Thus, we lay the foundation for a rigorous statistical theory for high-dimensional shape analysis, and we contribute a new estimation method that is well-suited to practical scientific settings.
Duality of Bures and Shape Distances with Implications for Comparing Neural Representations
Harvey, Sarah E., Larsen, Brett W., Williams, Alex H.
A multitude of (dis)similarity measures between neural network representations have been proposed, resulting in a fragmented research landscape. Most of these measures fall into one of two categories. First, measures such as linear regression, canonical correlations analysis (CCA), and shape distances, all learn explicit mappings between neural units to quantify similarity while accounting for expected invariances. Second, measures such as representational similarity analysis (RSA), centered kernel alignment (CKA), and normalized Bures similarity (NBS) all quantify similarity in summary statistics, such as stimulus-by-stimulus kernel matrices, which are already invariant to expected symmetries. Here, we take steps towards unifying these two broad categories of methods by observing that the cosine of the Riemannian shape distance (from category 1) is equal to NBS (from category 2). We explore how this connection leads to new interpretations of shape distances and NBS, and draw contrasts of these measures with CKA, a popular similarity measure in the deep learning literature.
Soft Matching Distance: A metric on neural representations that captures single-neuron tuning
Khosla, Meenakshi, Williams, Alex H.
Common measures of neural representational (dis)similarity are designed to be insensitive to rotations and reflections of the neural activation space. Motivated by the premise that the tuning of individual units may be important, there has been recent interest in developing stricter notions of representational (dis)similarity that require neurons to be individually matched across networks. When two networks have the same size (i.e. same number of neurons), a distance metric can be formulated by optimizing over neuron index permutations to maximize tuning curve alignment. However, it is not clear how to generalize this metric to measure distances between networks with different sizes. Here, we leverage a connection to optimal transport theory to derive a natural generalization based on "soft" permutations. The resulting metric is symmetric, satisfies the triangle inequality, and can be interpreted as a Wasserstein distance between two empirical distributions. Further, our proposed metric avoids counter-intuitive outcomes suffered by alternative approaches, and captures complementary geometric insights into neural representations that are entirely missed by rotation-invariant metrics.
Robust Unsupervised Learning of Temporal Dynamic Interactions
Guha, Aritra, Lei, Rayleigh, Zhu, Jiacheng, Nguyen, XuanLong, Zhao, Ding
Robust representation learning of temporal dynamic interactions is an important problem in robotic learning in general and automated unsupervised learning in particular. Temporal dynamic interactions can be described by (multiple) geometric trajectories in a suitable space over which unsupervised learning techniques may be applied to extract useful features from raw and high-dimensional data measurements. Taking a geometric approach to robust representation learning for temporal dynamic interactions, it is necessary to develop suitable metrics and a systematic methodology for comparison and for assessing the stability of an unsupervised learning method with respect to its tuning parameters. Such metrics must account for the (geometric) constraints in the physical world as well as the uncertainty associated with the learned patterns. In this paper we introduce a model-free metric based on the Procrustes distance for robust representation learning of interactions, and an optimal transport based distance metric for comparing between distributions of interaction primitives. These distance metrics can serve as an objective for assessing the stability of an interaction learning algorithm. They are also used for comparing the outcomes produced by different algorithms. Moreover, they may also be adopted as an objective function to obtain clusters and representative interaction primitives. These concepts and techniques will be introduced, along with mathematical properties, while their usefulness will be demonstrated in unsupervised learning of vehicle-to-vechicle interactions extracted from the Safety Pilot database, the world's largest database for connected vehicles.